The proteo[X] family currently contains members of
The document is mainly about proteoQ.
Tandem mass tag (TMT) and label-free quantitaion (LFQ) have been commonly applied in mass spectrometry (MS)-based quantification of proteins and peptides. The proteoQ tool is designed for the mining of proteomics data. It takes database search results from various search engines.
The tool interacts with an Excel spread sheet (or .csv)
for dynamic sample selections, aesthetics controls and statistical
modelings. It further integrates operations, such as data normalization,
against chosen rows and/or columns of data into functions at the users’
interface. The arrangements allow users to put ad hoc
manipulation of data behind the scene and instead apply metadata to
openly address biological questions using various data pre-processing
and informatic tools. In addition, the entire workflow is documented and
can be conveniently reproduced upon revisiting.
The informatic framework of proteoQ consists of data processing and informatics analysis. It first processes the peptide spectrum matches (PSM) tables from the search engines of Mascot, MaxQuant, MSFragger, proteoM or Spectrum Mill, for TMT (\(\le\) 16-plex), LFQ or SILAC experiments, using mass analyzers of Thermo’s Orbitrap or Bruker’s timsTOF. Peptide and protein results are then produced with users’ selection of parameters in data filtration, alignment and normalization. The package further offers a suite of tools and functionalities in statistics, informatics and data visualization by creating ‘wrappers’ around published R routines.1
(Click Recent Posts for additional examples.)
(Click here to render a html version of the README.)
To install this package, start R (latest version) and enter:
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("qzhang503/proteoQ")
# Optional data package
devtools::install_github("qzhang503/proteoQDA")or to use a daily version:
devtools::install_github("qzhang503/proteoQ@QZ_20190930")In this document, I first illustrate the following applications of proteoQ:
The data set we will use in this section corresponds to the
proteomics data from Mertins et al. (Martins
2018). In the study, two different breast cancer subtypes, triple
negative (WHIM2) and luminal (WHIM16), from patient-derived xenograft
(PDX) models were assessed by three independent laboratories. At each
site, samples from WHIM2 and WHIM16 were each split and labeled with
10-plex TMT at equal sample sizes and repeated on a different day. This
results in a total of 60 samples labeled under six 10-plex TMT
experiments. The samples under each 10-plex TMT were fractionated by
off-line, high pH reversed-phase (Hp-RP) chromatography, followed by
on-line LC/MS analysis. The MS data were analyzed against the search
engines of Mascot, MaxQuant, MSFragger, proteoM
or Spectrum Mill. Ten percent of the PSM entries were sampled randomly
from the complete data sets and stored in a companion package,
proteoQDA. For simplicity, only TMT examples will be
discussed in the document. Examples of LFQ analysis can be found in the
help document of proteoQ::load_expts, which is another good
stop for exploring proteoQ.
The data packages, proteoQDA, should have been made available through the proteoQ installation.2
RefSeq databases of human and mouse were used in the MS/MS searches against the WHIM data sets. To properly annotate protein entries with proteoQ, we would need the fasta file(s) that were used in the database searches.3 In the example below, we copy over the corresponding fasta files from the proteoQDA to a database folder:
library(proteoQDA)
copy_refseq_hs("~/proteoQ/dbs/fasta/refseq")
copy_refseq_mm("~/proteoQ/dbs/fasta/refseq")The data processing begins with PSM table(s) from Mascot, MaxQuant, MSFragger, proteoM or Spectrum Mill with the following compilation in file names:
F, followed by digits and
ends with .csv;msms and end with
.txt;psm and end with
.tsv;psmQ and end with
.txt;PSMexport and end with
.ssv.The corresponding PSMs are available through one of the followings
copy_ utilities:
# Mascot
copy_mascot_gtmt()
# or MaxQuant
copy_maxquant_gtmt()
# or MSFragger
copy_msfragger_gtmt()
# or proteoM
copy_proteom_gtmt()
# or Spectrum Mill
# (temporarily unavailable)To illustrate, I copy over Mascot PSMs to a working directory,
dat_dir:
dat_dir <- "~/proteoQ/examples"
copy_mascot_gtmt(dat_dir)The workflow involves an Excel (or .csv) template
containing the metadata of multiplex experiments, including experiment
numbers, TMT channels, LC/MS injection indexes, sample IDs, reference
channels, RAW MS data file names and additional fields from users. The
default file name for the experimental summary is
expt_smry.xlsx. If samples were fractionated off-line prior
to LC/MS, a second Excel template will also be filled out to link
multiple RAW MS file names that are associated to the same sample IDs.
The default file name for the fractionation summary is
frac_smry.xlsx.4 Unless otherwise mentioned, we will assume
these default file names throughout the document.
Columns in the expt_smry.xlsx are approximately divided into the
following three tiers: (1) essential, (2)
optional default and (3) optional open. We
supply the required information of the TMT experiments under the
essential columns. The optional default columns serve as the fields for
convenient lookups in sample selection, grouping, ordering, aesthetics
etc. For instance, the program will by default look for values under the
Color column if no instruction was given in the color
coding of a PCA plot. The optional open fields on the other hand allow
us to define our own analysis and aesthetics. For instance, we may
openly define multiple columns of contrasts at different levels of
granularity for uses in statistical modelings. For details, see the help
via ?proteoQ::load_expts().
We next copy over a pre-compiled expt_smry.xlsx and a frac_smry.xlsx to the working directory:
copy_exptsmry_gtmt(dat_dir)
copy_fracsmry_gtmt(dat_dir)We now have all the pieces that are required by proteoQ in place.
Let’s have a quick glance at the expt_smry.xlsx file. We note that no
reference channels were indicated under the column
Reference. With proteoQ, the log2FC of each species in a
given sample is calculated either (a) in relative to the
reference(s) within each multiplex TMT experiment or (b) to the
mean of all samples in the same plex if reference(s) are absent. Hence,
the later approach will be employed to the exemplary data set that we
are working with. In this special case, the mean(log2FC)
for a given species in each TMT experiment is averaged from five
WHIM2 and five WHIM16 aliquots, which are
biologically equivalent across TMT experiments.
As a final step of the setup, we will load the experimental summary into a work space:
library(proteoQ)
load_expts("~/proteoQ/examples")PSMs are MS/MS events that lead to peptide identification at certain confidence levels. The evidences in PSMs can then be summarized to peptide and protein findings using various descriptive statistics. In this section, we will apply proteoQ to summarize PSM data into peptide and protein reports.
We start the section by processing the PSM files from Mascot searches:
# columns keys in PSM files suitable for varargs of `filter_`
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = gene,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
rptr_intco = 1000,
rm_craps = TRUE,
rm_krts = FALSE,
rm_outliers = FALSE,
annot_kinases = TRUE,
plot_rptr_int = TRUE,
plot_log2FC_cv = TRUE,
filter_psms = rlang::exprs(pep_expect <= .1, pep_score >= 15),
filter_more_psms = rlang::exprs(pep_rank == 1),
)At group_psm_by = pep_seq, PSM entries with the same
primary peptide sequence but different variable modifications will be
grouped for analysis using descriptive statistics. In case
group_psm_by = pep_seq_mod, PSMs will be grouped
alternatively according to the unique combination of the primary
sequences and the variable modifications of peptides. Analogously,
group_pep_by specify the grouping of peptides by either
protein accession names or gene names. The fasta argument
points to the location of a copy of the RefSeq fasta files that were
used in the corresponding MS/MS searches. More description of
normPSM can be found by accessing its help document via
?normPSM.
Every time the normPSM module is executed, it will
process the PSM data from the ground up. In other words, it has no
memory on prior happenings. For instance, after inspecting graphically
the intensity distributions of reporter ions at
plot_rptr_int = TRUE, we may reconsider a more inclusive
cut-off at rptr_intco = 100. The downward in
rptr_intco (from 1,000 to 100) is not going to
cause information loss. In words, there is no data loss in the range of
100 to 1,000 in reporter-ion intensities. This is trivia but worth
mentioning here. As we will find out in following sections, utilities in
peptide and protein normalization, standPep and
standPrn, do pass information onto successive iterations
and can lead to interesting features such as mixed-bed normalization of
data etc.
For experiments that are proximate in the quantities of input
materials, there might still be unprecedented events that could have
caused dipping in the ranges of reporter-ion intensity for certain
samples. With proper justification, we might consider excluding the
outlier samples from further analysis. The sample removal and PSM
re-processing can be achieved by simply deleting the corresponding
entries under the column Sample_ID in expt_smry.xlsx,
followed by the re-execution of normPSM() (See additional
notes on data
exclusion and metadata
for LFQ).
There is a subtle problem when we choose to remove PSM outliers at
rm_outliers = TRUE. Note that PSM outliers will be assessed
at a per-peptide-and-per-sample basis, which could be a slow process for
large data sets. To circumvent repeated efforts in finding PSM outliers,
we may initially set rm_outliers = FALSE and
plot_rptr_int = TRUE when executing normPSM().
This will allow us to first decide on an ultimate threshold of
reporter-ion intensity, before proceeding to the more time-consuming
procedure in PSM outlier removals.
The normPSM function can take additional, user-defined
arguments of dot-dot-dot (see
Wickham 2019, ch. 6) for the row filtration of data using logical
conditions. In the above example, we have limited ourselves to PSM
entries with pep_expect <= 0.1 and
pep_score >= 15 by supplying the variable argument
(vararg) of filter_psms_at. We further filtered the data at
pep_rank == 1 with another vararg of
filter_psms_more. It makes no difference whether we put the
conditions in one or multiple statements:
normPSM(
filter_psms_at = rlang::exprs(pep_expect <= .1, pep_score >= 15, pep_rank == 1),
...,
)The creation and assignment of varargs need to follow a format of
filter_blahblah = exprs(cdn1, cdn2, ..., cdn_last). Note
that the names of varargs on the lhs start with the character string of
filter_ to indicate the task of data filtration. On the
rhs, pep_expect, pep_score and
pep_rank are column keys that can be found from the Mascot
PSM data. Backticks will be needed for column keys containing white
space(s) and/or special character(s):
`key with space (sample id in parenthesis)`. Analogously,
we can apply the vararg approach to the PSMS from MaxQuant,
MSFragger and Spectrum Mill:
# `PEP` and `Mass analyzer` are column keys in MaxQuant PSM tables
normPSM(
filter_psms_at = rlang::exprs(PEP <= 0.1, `Mass analyzer` == "FTMS"),
...,
)
# `Expectation` and `Retention` are column keys in MSFragger PSM tables
normPSM(
filter_psms_at = rlang::exprs(Expectation <= 0.1, Retention >= 50),
...,
)
# `score` is a column key in Spectrum Mill PSM tables
normPSM(
filter_psms_at = rlang::exprs(score >= 10),
...,
)Direct assignments of logical conditions to arguments in
R functions might be a little involved. To get around this,
I took advantage of the facility of non-standard evaluation in
rlang package in that the logical conditions are supplied
within the round parenthesis after exprs. Next, the proteoQ
program will obtain the expression(s) on the rhs of each vararg
statement by performing a bare evaluation using
rlang::eval_bare. Following that, a tidy evaluation by
rlang::eval_tidy will be coupled to a local facility in
proteoQ to do the real work of data filtrations ((see Wickham 2019, ch. 20)).
The approach of data filtration taken by normPSM might
at first looks strange; however, it allows me to perform data filtration
in an integrated way. As mentioned in the beginning, one of the key
themes of proteoQ is to reduce or avoid direct data manipulations but
utilizes metadata to control both data columns and rows. With the
self-containedness in data filtration (and data ordering later), I can
readily recall and reproduce what I had done when revisiting the system
after an extended period. Otherwise, I would likely need ad hoc
operations by mouse clicks or writing ephemeral R scripts, and soon
forget what I have done.
Moreover, the build-in approach can serve as building blocks for more
complex data processing. As shown in the help documents via
?standPep and ?standPrn, we can readily
perform mixed-bed normalization by sample groups, against either full or
partial data.
With normPSM, we can pretty much filter_
data under any PSM columns we like. In the above Mascot example, I have
chosen to filter PSM entires by their pep_expect,
pep_score etc. There is a reason for this.
Let’s first consider a different column pep_len. The
values underneath are unique to both PSMs and peptides. As you might
courteously agree, its time has not yet come in terms of
tentative data filtration by peptide length. In other words, we can
delay the filtration of peptide entries by their sequence lengths when
we are actually working with peptide data. The summarization of PSMs to
peptides is not going to change the number of amino acid residues in
peptides. By contrast, the data under pep_expect are unique
to PSMs, but not necessary to peptides. This is obvious in that each of
the PSM events of the same peptide is likely to have its own confidence
expectation in peptide identification. Therefore, if we were to filter
data by their pep_expect values at a later stage of
analysis, we would have lost the authentic information in
pep_expect for peptides with multiple PSM identifications.
More specifically, the values under pep_expect in peptide
tables are the geometric-mean representation of PSM results (see also
section 4).
For this reason, I named the varargs filter_psms_at and
filter_psms_more in the above normPSM
examples. This allows me to quickly grasp the action that I was
filtering data based on criteria that are specific to PSMs.
Vararg statements of filter_ and arrange_
are available in proteoQ for flexible filtration and ordering of data
rows. In addition, there are filter2_ and
arrange2_. As indicated by their names,
filter_ and filter2_ perform row filtration
against column keys from a primary data file, df, and
secondary data file(s), df2, respectively (df
and df2 defined here).
The same correspondence is applicable for arrange_ and
arrange2_ varargs.
Users will typically employ either primary or secondary vararg
statements, but not both. In the more extreme case of
gspaMap(...), it links prnGSPA(...) findings
in df2 to the significance p-values and abundance fold
changes in df for volcano plot visualization by gene
sets.
To finish our discussion of PSM processing, let us consider having
one more bash in data cleanup. The corresponding utility is
purgePSM. It performs data purging by the CV of peptides,
measured from contributing PSMs within the same sample. Namely,
quantitations that have yielded peptide CV greater than a user-supplied
cut-off will be replaced with NA.
The purgePSM utility reads files
/PSM/TMTset1_LCMSinj1_PSM_N.txt etc. from a preceding step
of normPSM and updates the files accordingly.5 To revert
programmatically the data nullification made by purgePSM
(should there be any), we would need to start over with
normPSM. Alternatively, we may make a temporary copy of
these files for a probable undo.
This process takes place sample (column)-wisely while holding the
places for data points that have been nullified. It is different to the
above row filtration processes by filter_ in that there is
no row removals with purging.
Earlier in section 1.2.1, we have set
plot_log2FC_cv = TRUE by default when calling
normPSM. This will plot the distributions of the CV of
peptide log2FC. In the event of plot_log2FC_cv = FALSE, we
can have a second chance in visualizing the distributions of peptide CV
before any permanent data nullification:
purgePSM ()Taking the sample entries under TMT_Set one and
LCMS_Injection one in label_scheme.xlsx as an example, we
can see that a small portion of peptides have CV greater than 0.5 at
log2 scale (Figure 1A).
Figure 1A-1C. CV of peptide log2FC (based on full data set). Left: no CV cut-off; middle: CV cut-off at 0.5; right: CV cut-off at 95 percentile.
Quantitative differences greater than 0.5 at a log2 scale is relatively large in TMT experiments 6, which can be in part ascribed to a phenomenon called peptide co-isolation and co-fragmentation in reporter ion-based MS experiments. We might, for instance, perform an additional cleanup by removing column-wisely data points with CV greater than 0.5 (Figure 1B):
purgePSM (
max_cv = 0.5,
)The above method using a flat cut-off would probably fall short if the ranges of CV are considerably different across samples (see Lab 3.1). Alternatively, we can remove low-quality data points using a CV percentile, let’s say at 95%, for each sample (Figure 1C):
# copy back `\PSM\TMTset1_LCMSinj1_PSM_N.txt` etc. before proceed
# otherwise the net effect will be additive to the prior(s)
purgePSM (
pt_cv = 0.95,
)In the event of both pt_cv and max_cv being
applied to nullify data, they follow the precedence of
pt_cv > max_cv. When needed, we can overrule the default
by executing purgePSM sequentially at a custom order:
# at first no worse than 0.5
purgePSM (
max_cv = 0.5,
)
# next `pt_cv` on top of `max_cv`
purgePSM (
pt_cv = 0.95,
)The data purge is also additive w.r.t. to repetative analysis. In the following example, we are actually perform data cleanup at a CV threshold of 90%:
# at first 95%
purgePSM (
pt_cv = 0.95,
)
# next 95% of 95%
purgePSM (
pt_cv = 0.95,
)While multiple PSMs carry information about the precision in peptide
measures, the above single-sample variance does not inform sampling
errors prior to peptide separations. For instance, the same peptide
species from a given sample remain indistinguishable/exchangeable prior
to the off-line fractionation. As a result, the CV shown by
normPSM or purgePSM mainly tell us the
uncertainty of measures beyond the point of peptide parting.
NB: CV is sensitive to outliers and some large CV in peptide
quantitations may be merely due to a small number of aberrant measures
(see George Casella and Roger L Berger 2002
ch. 10 for quantitative descriptions). Although the option of
rm_outliers was set to FALSE during our
earlier call to normPSM, I think it is generally a good
idea to have rm_outliers = TRUE.
Up to this point, we should have arrived at a set of common
column names in the PSM outputs of
PSM/TMTset1_LCMSinj1_PSM_N.txt etc., across various search
engines (for both TMT and LFQ). To continue, we will summarize the PSM
results to peptides with PSM2Pep, mergePep,
standPep and optional purgePep.
The utility for the summary of PSMs to peptides is
PSM2Pep:
PSM2Pep()It loads the above mentioned PSM tables from the preceding
normPSM procedure, and summarize them to peptide data using
various descriptive statistics (see also Section 4). For
intensity and log2FC data, the summarization
method may be specified by argument method_psm_pep, with
median being the default.
Data filtration via varargs is also available with
PSM2Pep. For instance, we might consider filter the data by
the MS1 intensities of peptide matches, which are indicated under the
column pep_tot_int. However, you probably don’t want to try
the following codes with the Mascot example that we are currently
using:
PSM2Pep(
filter_ms1int = rlang::exprs(pep_tot_int >= 1E4)
)To make it functionally valid, we would need to include the
Raw peptide match data when exporting Mascot
PSMs.
Following the summarization of PSMs to peptides, the utility
mergePep will assemble individual peptide tables,
Peptide/TMTset1_LCMSinj1_Peptide_N.txt,
TMTset1_LCMSinj2_Peptide_N.txt etc., into one larger piece,
Peptide.txt.
mergePep(
filter_peps_by = rlang::exprs(pep_len <= 100),
)Similar to normPSM and PSM2Pep, we could
again filter data via column keys linked to the varargs of
filter_. In the exemplary vararg statement of
filter_peps_by, we exlcude longer peptide sequences with
more than 100 amino acid residues. If we are interested in human, but
not mouse, peptides from the pdx samples, we can specify similarly that
species == "human". Sometimes, it may remain unclear on
proper data filtration at the early stage of analysis. In that case, we
may need additional quality assessments that we will soon explore.
Alternatively, we may keep as much information as possible and apply
varargs in downstream analysis.
The utility standPep standardizes peptide results from
mergePep with additional choices in data alignment.
standPep(
range_log2r = c(10, 90),
range_int = c(5, 95),
method_align = MGKernel,
n_comp = 3,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
)The parameters range_log2r and range_int
outline the ranges of peptide log2FC and reporter-ion
intensity, respectively, for use in defining the CV and scaling the
log2FC across samples. The log2FC of peptide data will be aligned by
median centering across samples by default. If
method_align = MGKernel is chosen, log2FC will be aligned
under the assumption of multiple Gaussian kernels.7 The companion
parameter n_comp defines the number of Gaussian kernels and
seed set a seed for reproducible fittings. Additional
parameters, such as, maxit and epsilon, are
defined in and for use with normalmixEM.
It is also feasible to perform standPep against defined
sample columns and data rows. Moreover, the utility can be applied
interactively with cumulative effects. Combinations and iterations of
the features can lead to specialty sample alignments that will discuss
soon (sections 1.3.5 - 1.3.7). Before delving more into the details, we
would probably need some helps from the pepHist utility in
the immediately following.
The pepHist utility plots the histograms of peptide
log2FC. It further bins the data by their contributing reporter-ion or
LFQ intensity. In the examples shown below, we compare the
log2FC profiles of peptides with and without scaling
normalization:8
# without scaling
pepHist(
scale_log2r = FALSE,
ncol = 10,
)
# with scaling
pepHist(
scale_log2r = TRUE,
ncol = 10,
)By default, the above calls topepHist will look for none
void entries under column Select in expt_smry.xlsx. This
will results in histogram plots with 60 panels in each, which may not be
easy to explore as a whole. In stead, we will break the plots down by
their data origins. We begin with modifying the expt_smry.xlsx file by
adding the columns BI_1, JHU_1 etc. Each of
the new columns includes sample entries that are tied to their
laboratory origins and TMT batches (the columns are actually already in
the expt_smry.xlsx).
We now are ready to plot histograms for each subset of the data. In
this document, we only display the plots using the BI_1
subset:
# without scaling
pepHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filename = bi1_n.png,
)
# with scaling
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = bi1_z.png,
)NB: We interactively told pepHist() that we are
interested in sample entries under the newly created BI_1
column. Behind the scene, the interactions occurred via the reading of
the Setup workbook in expt_smry.xlsx. We also supply a file
name, assuming that we want to keep the previously generated plots with
default file names of Peptide_Histogram_N.png and
Peptide_Histogram_Z.png.
Figure 2A-2B. Histograms of peptide log2FC. Top:
scale_log2r = FALSE; bottom,
scale_log2r = TRUE
As expected, both the widths and the heights of log2FC profiles
become more comparable after the scaling normalization. However, such
adjustment may cause artifacts when the standard deviation across
samples are genuinely different. I typically test
scale_log2r at both TRUE and
FALSE, then make a choice in data scaling together with my
a priori knowledge of the characteristics of both samples and
references.9 We will use the same data set to illustrate
the impacts of reference selections in scaling normalization in Lab 3.1.
It should also be noted that the curves of Gaussian density in
histograms are calculated during the latest call to
standPep(...) with the option of
method_align = MGKernel. There is a useful side effect when
comparing leading and lagging profiles of log2FC. In the following
bare-bones example, we align differently the peptide log2FC with the
default method of median centering:
standPep()We then visualize the histograms of the ratio profiles (Figure 2C):
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = bi1_z_mc.png,
)Within this document, the preceding example that involves
standPep(...) at method_align = MGKernel is
given in section 1.3.3. In this case, a comparison between the present
and the prior histograms will reveal the difference in ratio alignments
between a median centering and a three-Gaussian assumption. More
examples in the side effects can be found from the help document via
?standPep and ?pepHist.
Figure 2C-2D. Histograms of peptide log2FC. Top:
median-centering for all samples; bottom: W2.BI.TR2.TMT1
aligned differently by Gaussian density
The varargs of filter_ are also available in the
pepHist utility. With the following examples, we can
visualize the peptide log2FC with human and mouse
origins, respectively:
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filter_by_sphu = rlang::exprs(species == "human"),
filename = hs.png,
)
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filter_by_spmm = rlang::exprs(species == "mouse"),
filename = mm.png,
)Now that we have been acquainted with pepHist, let’s
revisit and explore additionally standPep with its features
in normalizing data against defined sample columns (and data rows in the
following sections).
Needs in data (re)normalization may be encountered more often than
not. One of the trivial circumstances is that a multi-Gaussian kernel
can fail capturing the log2FC profiles for a subset of samples. This is
less an issue with a small number of samples. Using a trial-and-error
approach, we can start over with a new combination of parameters, such
as a different seed, and/or a different range of
range_log2r etc. However, the one-size-fit-all attempt may
remain inadequate when the number of samples is relatively large. With
proteoQ, we can chose to focus fit against selected samples.
This is again the job of argument col_select. Let’s say we
want to re-fit the log2FC for samples W2.BI.TR2.TMT1 and
W2.BI.TR2.TMT2. We simply add a column, which I named it
Select_sub, to expt_smry.xlsx with the sample entries for
re-fit being indicated under the column:
We may then execute the following codes with argument
col_select being linked to the newly created column:
standPep(
method_align = MGKernel,
range_log2r = c(10, 90),
range_int = c(5, 95),
n_comp = 3,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
col_select = Select_sub,
)
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = mixed_bed_3.png,
)In the preceding execution of bare-bones standPep(),
samples were aligned by median centering (Figure 2C).
As expected, the current partial re-normalization only affects samples
W2.BI.TR2.TMT1 and W2.BI.TR2.TMT2
(Figure 2D, W2.BI.TR2.TMT2 not shown). In
other words, samples W2.BI.TR2.TMT1 and
W2.BI.TR2.TMT2 are now aligned by their Gaussian densities
whereas the remaining are by median centering. The combination allows us
to align sample by mixed-bedding the MC or the
MGKernel method.
We have previously applied the varargs of filter_ in
normPSM and mergePep to subset data rows. With
this type of arguments, data entries that have failed the filtration
criteria will be removed for indicated analysis.
Similarly, we employed the filter_ varargs in
pepHist to subset peptides with human or mouse origins
(section 1.3.4.3). This is often not an issue in informatic analysis and
visualization, as we do not typically overwrite the altered inputs on
external devices at the end. Sometimes we may however need to carry out
similar tasks based on partial inputs and update the complete set of
data for future uses. One of the circumstances is model parameterization
by a data subset and to apply the finding(s) to update the complete
set.
The standPep utility accepts variable arguments of
slice_. The vararg statement(s) identify a subset of data
rows from the Peptide.txt. The partial data will be taken
for parameterizing the alignment of log2FC across samples. In the
hypothetical example shown below, we normalize peptide data based
peptide entries with sequence lengths greater than 10 and smaller than
30. The full data set will be updated accordingly with the newly derived
parameters. Different to the filter_ varargs, there is
no data entry removals from the complete data set with the
slice_ procedure.
## DO NOT RUN
standPep(
...,
slice_peps_by = rlang::exprs(pep_len > 10, pep_len < 30),
)
## END of DO NOT RUNThe varargs are termed slice_ to make distinction to
filter_. Although it might at first seem a little involved,
the underlying mechanism is simple: col_select defines the
sample columns and slice_ defines the data
rows in Peptide.txt; and only the intersecting
area between columns and rows will be subject additively to the
parameterization in data alignment. The same pattern will be applied
every time we invoke standPep.
Just like col_select and filter_ in
pepHist, the combination in fixed argument
col_select and variable argument
slice_ can lead to features in versatile data processing.
Several working examples are detailed and can be accessed via
?standPep and ?standPrn.
Now it becomes elementary if we were to normalize data against a set
of housekeeping protein(s). Let’s say we have GAPDH in mind
as an accurate housekeeping invariant among the proteomes, and of course
we are confident about the good accuracy in their log2FC. We simply
slice the peptide entries under GAPDH out for
use as a normalizer:
standPep(
method_align = MC,
range_log2r = c(10, 90),
range_int = c(5, 95),
col_select = Select_sub,
slice_hskp = rlang::exprs(gene %in% c("GAPDH")),
)
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = housekeepers.png,
)Note that I chose method_align = MC in the above. There
are only a few rows available for the samples linked to
col_select, after slicing out GAPDH! The number of data
points is too scare for fitting the selected samples against a
3-component Gaussian. A more detailed working example can also be found
via ?standPep where you would probably agree that GAPDH is
actually not a good normalizer for the data set.
Analogously to the PSM processing, we may nullify data points of peptides by specifying a cut-off in their protein CVs:
# no purging
purgePep()
# or purge column-wisely by max CV
purgePep (
max_cv = 0.5,
filename = "by_maxcv.png",
)
# or purge column-wisely by CV percentile
# remember the additive effects
purgePep (
pt_cv = 0.5,
filename = "by_ptcv.png",
)NB: The above single-sample CVs of proteins are based on
ascribing peptides, which thus do not inform the uncertainty in sample
handling prior to the parting of protein entities, for example, the
enzymatic breakdown of proteins in a typical MS-based proteomic
workflow. On the other hand, the peptide log2FC have been previously
summarized by the median statistics from contributing PSMs. Putting
these two together, the CV by purgePep describes
approximately the uncertainty in sample handling from the breakdown of
proteins to the off-line fractionation of peptides. (see also George Casella and Roger L Berger 2002
ch. 7.3.3 for the technical details of conditional variance.)
In this section, we summarize peptides to proteins, for example, using a two-component Gaussian kernel and customized filters.
The utility for the summary of peptides to proteins is
Pep2Prn:
Pep2Prn()It loads the Peptide.txt and summarize the peptide data
to interim protein results in Protein.txt, using various
descriptive statistics (see also Section 4). For intensity and log2FC
data, the summarization method is specified by argument
method_pep_prn, with median being the
default.
The utitily also accept varargs of filter_ for data row
filtration against the column keys in Peptide.txt.
The utility standPrn standardizes protein results from
Pep2Prn with additional choices in data alignment.
standPrn(
range_log2r = c(10, 90),
range_int = c(5, 95),
method_align = MGKernel,
n_comp = 2,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
slice_prots_by = rlang::exprs(prot_n_pep >= 2),
)It loads Protein.txt from Pep2Prn or a
preceding standPrn procedure and align protein data at
users’ choices. The utility is analogous to standPep with
choices in col_select and slice_. In the above
example, the normalization is against proteins with two more identifying
peptides. For helps, try ?standPrn.
Similar to the peptide summary, we can inspect the alignment and the scale of ratio profiles for protein data:
# without scaling
prnHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filename = bi1_n.png,
)
# with scaling
prnHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = bi1_z.png,
)For simplicity, we only display the histograms with scaling normalization (Figure 2E).
Figure 2E-2F. Histograms of protein log2FC at
scale_log2r = TRUE. Left: before filtration; right, after
filtration
In section 1.3.4.2, we used pepHist to illustrate the
side effects in histogram visualization when toggling the alignment
methods between MC and MGKernel. In the
following, we will show another example of side effects using the
protein data.
We prepare the ratio histograms for proteins with ten or more quantifying peptides:
# without scaling
prnHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filter_prots_by = rlang::exprs(prot_n_pep >= 10),
filename = bi1_n_npep10.png,
)
# with scaling
prnHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filter_prots_by = rlang::exprs(prot_n_pep >= 10),
filename = bi1_z_npep10.png,
)The density curves are based on the latest call to
standPrn(...) with method_align = MGKernel
(Figure 2E). For simplicity, we again only show the
current plots at scale_log2_r = TRUE (Figure
2F). The comparison between the lead and the lag allows us to
visualize the heteroscedasticity in data and in turn inform new
parameters in data renormalization.
Up to this point, we might have reach a consensus on the choice of
scaling normalization. If so, it may be plausible to set the value of
scale_log2r under the Global environment, which is
typically the R console that we are interacting with.
# if agree
scale_log2r <- TRUE
# or disagree
scale_logr <- FALSEIn this way, we can skip the repetitive setting of
scale_log2r in our workflow from this point on, and more
importantly, prevent ourselves from peppering the values of
TRUE or FALSE in scale_log2r from
analysis to analysis.
Exemplary scripts are summarized in:
system.file("extdata", "workflow_tmt_base.R", package = "proteoQ")
system.file("extdata", "workflow_tmt_ext.R", package = "proteoQ")Another place to get help is via ?load_expts.
system.file("extdata", "workflow_lfq_base.R", package = "proteoQ")Additional notes available in here.
For the purpose of quick demonstrations where steps in data preprocessing were bypassed:
unzip(system.file("extdata", "demo.zip", package = "proteoQDA"),
exdir = "~/proteoq_bypass", overwrite = FALSE)
# file.exists("~/proteoq_bypass/proteoQ/examples/Peptide/Peptide.txt")
# file.exists("~/proteoq_bypass/proteoQ/examples/Protein/Protein.txt")
library(proteoQ)
load_expts("~/proteoq_bypass/proteoQ/examples")
# Exemplary protein MDS
prnMDS(
show_ids = FALSE,
width = 8,
height = 4,
)In this section I illustrate the following applications of proteoQ:
Unless otherwise mentioned, the in-function filtration
of data by varargs of filter_ is available throughout this
section of informatic analysis. Row ordering of data, indicated by
arrange_, is available for heat map applications using
pepHM, prnHM and
plot_metaNMF.
We first visualize MDS and Euclidean distance against the peptide
data. We start with metric MDS for peptide data (prnMDS for
proteins):
# all data
pepMDS(
show_ids = FALSE,
)
Figure 3A. MDS of peptide log2FC at
scale_log2r = TRUE
It is clear that the WHIM2 and WHIM16 samples are well separated by
the Euclidean distance of log2FC (Figure 3A). We next
take the JHU data subset as an example to explore batch
effects in the proteomic sample handling:
# `JHU` subset
pepMDS(
col_select = JHU,
filename = jhu.png,
show_ids = FALSE,
height = 3,
width = 8,
)
Figure 3B-3C. MDS of peptide log2FC for the
JHU subset. Left: original aesthetics; right, modefied
aesthetics
We immediately spot that all samples are coded with the same color
(Figure 3B). This is not a surprise as the values under
column expt_smry.xlsx::Color are exclusively
JHU for the JHU subset. For similar reasons,
the two different batches of TMT1 and TMT2 are
distinguished by transparency, which is governed by column
expt_smry.xlsx::Alpha. We may wish to modify the aesthetics
using different keys: e.g., color coding by WHIMs and size coding by
batches, without the recourse of writing new R scripts. One solution is
to link the attributes and sample IDs by creating additional columns in
expt_smry.xlsx. In this example, we have had coincidentally prepared the
column Shape and Alpha to code WHIMs and
batches, respectively, for the JHU subset. Therefore, we
can recycle them directly to make a new plot (Figure
3C):
# new `JHU` subset
pepMDS(
col_select = JHU,
col_fill = Shape, # WHIMs
col_size = Alpha, # batches
filename = new_jhu.png,
show_ids = FALSE,
height = 3,
width = 8,
)While MDS approximates Euclidean and other distance
measures at a low-dimensional space. Sometimes it may be useful to have
an accurate view of the distance matrix. Functions
pepEucDist and prnEucDist plot the heat maps
of Euclidean distance matrix for peptides and proteins, respectively.
Supposed that we are interested in visualizing the distance matrix for
the JHU subset:
# `JHU` subset
pepEucDist(
col_select = JHU,
annot_cols = c("Shape", "Alpha"),
annot_colnames = c("WHIM", "Batch"),
# `pheatmap` parameters
display_numbers = TRUE,
number_color = "grey30",
number_format = "%.1f",
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
fontsize = 16,
fontsize_row = 20,
fontsize_col = 20,
fontsize_number = 8,
cluster_rows = TRUE,
show_rownames = TRUE,
show_colnames = TRUE,
border_color = "grey60",
cellwidth = 24,
cellheight = 24,
width = 14,
height = 12,
filename = jhu.png,
)The graphic controls of heat maps are achieved through pheatmap
with modifications. Parameter annot_cols defines the tracks
to be displayed on the top of distance-matrix plots. In this example, we
have chosen expt_smry.xlsx::Shape and
expt_smry.xlsx::Alpha, which encodes the WHIM subtypes and
the batch numbers, respectively. Parameter annot_colnames
allows us to rename the tracks from Shape and
Alpha to WHIM and Batch,
respectively, for better intuition. We can alternatively add columns
WHIM and Batch if we choose not to recycle and
rename columns Shape and Alpha.
Figure 3D. EucDist of peptide log2FC at
scale_log2r = TRUE
The utility is currently applied to Euclidean distances with an
argument adjEucDist for a probable compensation of
distances between TMT experiments. As mentioned earlier, the
quantitative log2FC are measured in relative to the reference materials
under each multiplex TMT experiments. When concatenating data across TMT
experiments, the measurement errors may accumulate differently. Likely
the uncertainty in the reference signals will be greater if we were to
prepare the references at an earlier stage of sample handling as opposed
to a later stage. I tried to go through the most fundamental
calculations step-by-step to help myself understand the differences:
Figure 3E. Accumulation of Euclidean distance in the
interplex comparison of log2FC
The adjustment might be more suitable for studies where both the
samples and references are largely similar in proteome compositions. The
setting of adjEucDist = TRUE would discount the distances
between references when using visualization techniques such a MDS or
distance heat maps. In the cases that sample differences are exceedingly
greater than handling errors, the setting of
adjEucDist = FALSE would probably be more appropriate.
The utilities for PCA analysis are pepPCA and
prnPCA for peptide and protein data, respectively. They are
wrappers of the stats::prcomp. Data scaling and centering
are the two aspects that have been emphasized greatly in PCA analysis.
Some notes on proteoQ data scaling are available in section 3.1.1; hence
in the present section, we will focus only on trials against data being
scaled. Additional notes about data centering can be found here.
With proteoQ, the option in data scaling is set by variable
scale_log2r, which will be passed to the
scale. in stats::prcomp. For data centering,
proteoQ relays the TRUE default to
stats::prcomp.
Provided the importance of data centering in PCA and several other
analyses, proteoQ further incorporated the three columns of
prot_mean_raw, prot_mean_n and
prot_mean_z in protein outputs. The first one summarizes
the mean log2FC before data alignment for individual
proteins across selected samples. The second and the three compute the
corresponding mean log2FC after data alignment, with and
without scaling normalization, respectively (see also section 4 for
column keys). The corresponding columns summarizing the mean deviation
in peptide data are pep_mean_raw, pep_mean_n
and pep_mean_z. As usual, the sample selections can be
customized through the argument col_select.
The mean log2FC of proteins or peptides may serve as
indicators that how far a given protein or peptide species is away from
the data centering format (a.k.a. mean deviation form) that will be
enforced by default in PCA. Taking protein data as an example, we will
go through couple settings in prnPCA. At first, we
performed PCA with data centering by default (Figure
4A):
prnPCA(
col_select = Select,
show_ids = FALSE,
filename = cent.png,
)We next performed another PCA with the removals of proteins that are far from mean deviation form (Figure 5B):
# observe that the overall deviations from "mean zero" may not be symmetric
prnPCA(
col_select = Select,
show_ids = FALSE,
filter_prots_by = rlang::exprs(prot_mean_z >= -.25, prot_mean_z <= .3),
filename = sub_cent.png,
)Note that the clusterings are tightened under each sample type of W2
or W16 after the filter_prots_by filtration. Further note
that the proportion of variance explained in the first
principal axis decreased from 57.5% to 55.4% after the data filtration.
This suggests that the entries deviating the most from mean
zero are more leveraging towards the explained variance,
even with data centering. In other words, high deviating entries are in
general associated with above-average data variance, in relative to the
entire data set. The observation also indicates that a high value of
proportion of variance explained may not necessary be a go-to
standard for differentiating sample types in that variance may be
sensitive to leveraging data points.
Figure 4A-4B. PCA of protein log2FC with data centering
on. Left: without filtration; right, with filtration
We next explore the analogous, but by turning off data centering:
prnPCA(
col_select = Select,
center = FALSE,
show_ids = FALSE,
filename = nocent.png,
)
prnPCA(
col_select = Select,
center = FALSE,
show_ids = FALSE,
filter_prots_by = rlang::exprs(prot_mean_z >= -.25, prot_mean_z <= .3),
filename = sub_nocent.png,
)First note that there is no labels of the proportion of variance explained since such a view of variance is often not suitable without data centering. Instead, an interpretation as square Euclidean distance would be more appropriate.
Further note the wider spread in PC1 and narrower in PC2 for the analysis without the removal of high deviation entries (Figure 4C versus 4D). The driving force for the difference may be again ascribed to the more leveraging data entries. Intuitively speaking, the high leverage points tend to associate with higher-than-normal Euclidean distance. This becomes more evident after the removals of the high deviation entries (Figure 4D).
The above showcases that the choice in data centering can lead to different interpretation in biology, which may be in part ascribed to high deviation entries. The phenomena can, however, be conveniently explored via proteoQ.
Figure 4C-4D. PCA of protein log2FC. Left: data
centering off without filtration; right, data centering
off with filtration
The y-labels in Figure 4C are not well separated.
This can be fixed by providing a custom theme to prnPCA
(see also the help document via ?prnPCA). Alternatively, we
may export the PCA results for direct ggplot2:
res <- prnPCA(
col_select = Select,
center = FALSE,
show_ids = FALSE,
filename = foo.png,
)
# names(res)
library(ggplot2)
my_theme <- theme_bw() + theme(
axis.text.x = element_text(angle=0, vjust=0.5, size=20),
axis.text.y = element_text(angle=0, vjust=0.5, size=20),
axis.title.x = element_text(colour="black", size=20),
axis.title.y = element_text(colour="black", size=20),
plot.title = element_text(face="bold", colour="black", size=20, hjust=0.5, vjust=0.5),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.key = element_rect(colour = NA, fill = 'transparent'),
legend.background = element_rect(colour = NA, fill = "transparent"),
legend.title = element_blank(),
legend.text = element_text(colour="black", size=14),
legend.text.align = 0,
legend.box = NULL
)
p <- ggplot(res$pca) +
geom_point(aes(x = PC1, y = PC2, colour = Color, shape = Shape,
alpha = Alpha), size = 4, stroke = 0.02) +
scale_y_continuous(breaks = seq(5, 15, by = 5)) +
labs(title = "", x = paste0("PC1 (", res$prop_var[1], ")"), y = paste0("PC2 (", res$prop_var[2], ")")) +
coord_fixed() +
my_theme
ggsave(file.path(dat_dir, "Protein/PCA/nocent_2.png"), width = 6, height = 4)Figure 4E. Custom plot.
The PCA findings at higher dimensions may be visualized via pairwise plots between principal components.
prnPCA(
show_ids = FALSE,
rank. = 4,
dimension = 3,
filename = d3.png,
)Figure 4F. Higher dimensions.
Additional examples and analogous high-dimension MDS can be found
from the help documents via ?prnPCA and
?prnMDS, respectively.
See notes here.
In this section, we visualize the batch effects and biological
differences through correlation plots. The proteoQ tool currently limits
itself to a maximum of 44 samples for a correlation plot. In the
document, we will perform correlation analysis against the
PNNL data subset. By default, samples will be arranged by
the alphabetical order for entries under the column
expt_smry.xlsx::Select. We have learned from the earlier
MDS analysis that the batch effects are smaller than the
differences between W2 and W16. We may wish to
put the TMT1 and TMT2 groups adjacent to each
other for visualization of more nuance batch effects, followed by the
comparison of WHIM subtypes. We can achieve this by supervising sample
IDs at a customized order. In the expt_smry.xlsx, We have prepared an
Order column where samples within the JHU
subset were arranged in the descending order of W2.TMT1,
W2.TMT2, W16.TMT1 and W16.TMT2.
Now we tell the program to look for the Order column for
sample arrangement:
# peptide logFC
pepCorr_logFC(
col_select = PNNL,
col_order = Order,
filename = pep_pnnl.png,
)
# protein logFC
prnCorr_logFC(
col_select = PNNL,
col_order = Group,
filename = prn_pnnl.png,
)